Lecture 3: Reproducible Research

David Shilane

2019-02-07

Analytical Reporting

The Scientific Method

Reproducibility is the Gold Standard of Science

This All Used to be Easy

Complex Data, Complex Analyses

Greater Barriers to Entry, Too

We now have greater opportunities than ever to work with large and diverse data sets. We also have greater reasons for skepticism in viewing the results described in reports.

Reproducible Research, Reproducible Reports

What is a Reproducible Report?

A Simple Project

Example: A homework assignment or small research report.

But It Probably Looks Like This

Moderate Complexity

Complex Analyses and Organizations

A Million Excuses

Creating reproducible processes, reports, and systems can be very challenging. There are always many reasons why it might be difficult to implement.

The Possible Results

However, a lack of accountability in the process can lead to major problems:

More Severe Results

We All Make Mistakes

Fixing the Process

Changing the Process Starts with You

Reproducibility is a Key to Productivity

So Let’s Create Reproducible Reports

Reproducible Reporting in R

Reproducibility Basics

Example: A simulated file of asthma data with 1 million patients.

dat <- fread("simulated asthma data.csv")
geo.group <- "Urban"
mean.ER.1yr.Urban <- round(x = dat[geog == geo.group, mean(ER_1yr)], digits = 2)
sd.ER.1yr.Urban <- round(x = dat[geog == geo.group, sd(ER_1yr)], digits = 2)
n <- dat[geog == geo.group, .N]

For patients who live in urban areas, the mean number of visits to the ER in the first year was 0.82 with a standard deviation of 1.06 on a sample size of 499311.

How I Actually Wrote This

Reproducibility

The Anatomy of a Report

Advantages of Reproducible Reporting

Reports are Complex

RMarkdown: Not Just for R

Program in any combination of these languages:

Output to any of these formats:

Run it all in RStudio with R in the background.

Creating Applications

Produce a full range of content:

RMarkdown builds on earlier designs (Shiny) in a more user-friendly way. It is designed for reproducible reporting.

GUIs

Produce a full range of web elements:

RMarkdown provides excellent opportunities to create customized, interactive reports.

Examples

Getting Started with RMarkdown

Code Chunks

Delineating Code Chunks

Code Chunks: Language and Identifier

Code Chunks – Parameters

Code Chunks – The echo Parameter

Examples of the echo Parameter:

x <- 3
y <- 5 * x
this.dat <- data.table(x, y)
datatable(data = this.dat, rownames = FALSE)

Code Chunks – The eval Parameter

Examples of the eval Parameter

x <- 3
y <- 5 * x
this.dat <- data.table(x, y)
datatable(data = this.dat, rownames = FALSE)
x <- 3
y <- 5 * x
this.dat <- data.table(x, y)
datatable(data = this.dat, rownames = FALSE)

Think of the Permutations:

Code Chunks – More Parameters

A few other notable parameters include:

I also frequently use parameters to specify the formatting of images.

Setting Defaults for the Parameters of Code Chunks

library(knitr)
opts_chunk$set(echo = TRUE, comment = "", warning = FALSE, message = FALSE, 
    tidy.opts = list(width.cutoff = 55), tidy = TRUE)

Strategies for Code Chunk Parameters

Code in the Exposition

How Complex Can Code in the Exposition Be? Should It Be?

The Outlines of a Report

Running Your Code

Advantages of RMarkdown

Now You’re Ready To Start Writing Reports!

RMarkdown Templates

Advanced Strategies for RMarkdown

Let’s explore each of these issues further

Reports for Other Users

Specifying Report-Level Parameters

Steps for Using Report-Level Parameters

Example: User-Specified Report-Level Parameters

## Set the working directory to the Source File Location.
library(rmarkdown)
report.period <- "Q1 2019"
beginning.date <- "2019-01-01"
ending.date <- "2019-03-31"
params <- list(report.period = report.period, beginning.date = beginning.date, 
    ending.date = ending.date)
output_file <- sprintf("Quarterly Report -- %s.html", report.period)
render(input = "Quarterly Report.Rmd", output_file = output_file, 
    params = params)

A Major Advantage of the render Function

Strategic Use of render

render(input = "Quarterly Report.Rmd", output_file = output_file, 
    params = params)
output_file_executive_summary <- sprintf("Quarterly Report -- %s -- Executive Summary.html", 
    report.period)
render(input = "Quarterly Report -- Executive Summary.Rmd", 
    output_file = output_file_executive_summary, params = params)

Enormous Flexibility and Capacity

Time for a Break

There’s More to Reproducibility, Though

Writing a reproducible report with a paradigm like RMarkdown can be very effective. However, the scope of reproducibility encompasses other processes in model development. It’s good to give some thought to these additional issues, which include:

Let’s discuss each of these topics in greater detail.

Model Change Management

Version Control

Options for Version Control

Implementing version control is a recommended practice for all engineering environments that create production-level code.

Full Reproducibilty Requires Great Coding Designs

What’s Left Unsaid in All of This

With experiene, you’ll increasingly recognize the challenges of this work and proactively design solutions. However, this is an ongoing process.

So Let’s Build a Reproducible Report!

Case Study: A Real World Clinical Trial for HIV Medicines

We now have the tools and basic methods of analysis to do some real work in health care. So let’s take a look at a real clinical trial of a medicine that is used to treat HIV.

Overview of HIV and AIDS

Treating HIV

A Clinical Trial

Gulick et al. conducted a randomized clinical trial:

Measuring Outcomes

The 3-Drug Cocktail Trial’s Data

Getting Started with the Analysis

Preliminaries for the Report

Typical Code Chunks

Typical Sections

Now Let’s Build the Analysis

Reading in the Data

## code chunk: constants
file.hiv.data <- "AIDS data.csv"
## code chunk: read_data
dat <- fread(input = file.hiv.data)

## Entered in console but not in the file directly:
dim(dat)
[1] 1151   16
names(dat)
 [1] "id"       "time"     "censor"   "time_d"   "censor_d" "tx"      
 [7] "txgrp"    "strat2"   "sex"      "raceth"   "ivdrug"   "hemophil"
[13] "karnof"   "cd4"      "priorzdv" "age"     

Adding Variable Names

## code chunk: constants
id.name <- "id"
time.aids.or.death.name <- "time"
time.death.name <- "time_d"
outcome.aids.or.death.name <- "censor"
outcome.death.name <- "censor_d"
treatment.name <- "tx"
age.name <- "age"
sex.name <- "sex"

Understanding the Data

## code chunk: data_investigations
dat[, .(n = .N, `Number of Unique IDs` = length(unique(get(id.name))))]
      n Number of Unique IDs
1: 1151                 1151
dat[, summary(get(time.aids.or.death.name))]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   174.0   257.0   230.2   300.0   364.0 
dat[, summary(get(time.death.name))]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   194.5   265.0   242.3   306.0   364.0 
dat[, .N, keyby = outcome.aids.or.death.name]
   censor    N
1:      0 1055
2:      1   96

Understanding the Data, Part II

## code chunk: data_investigations
dat[, .N, keyby = outcome.death.name]
   censor_d    N
1:        0 1125
2:        1   26
dat[, .N, keyby = treatment.name]
   tx   N
1:  0 577
2:  1 574
dat[, summary(get(age.name))]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.00   33.00   38.00   38.65   44.00   73.00 
dat[, .N, keyby = sex.name]
   sex   N
1:   1 951
2:   2 200

A Nice Feature

Plans for Data Cleaning

Data Cleaning

## code chunk:  constants
male.name <- "Male"
sex.category.male <- 1
sex.category.female <- 2
## code chunk:  data_cleaning
dat[, eval(male.name) := 1*(get(sex.name) == sex.category.male)]
## Then, in the Console, check the values with this line:
dat[, .N, keyby = c(sex.name, male.name)]
   sex Male   N
1:   1    1 951
2:   2    0 200

An Overall Plan for the Analysis

Sample Sizes

## code chunk: constants
received.treatment <- 1
no.treatment <- 0
## code chunk: analyze_data
n <- dat[, .N]
n.treatment <- dat[get(treatment.name) == received.treatment, 
    .N]
n.untreated <- dat[get(treatment.name) == no.treatment, 
    .N]

Reporting Sample Sizes

Splitting the analyze_data Code Chunk

Summarizing Demographics by Treatment Group: Male

Let’s calculate the number and percentage of males in each treatment group:

## code chunk: functions
count.and.percentage <- function(dat, variable.name, by.names, 
    na.rm = TRUE) {
    require(data.table)
    setDT(dat)
    
    tab <- dat[, .(C1 = sum(get(variable.name), na.rm = na.rm), 
        N = .N, P1 = 100 * mean(get(variable.name), na.rm = na.rm)), 
        by = by.names]
    setnames(x = tab, old = c("C1", "P1"), new = c(sprintf("%s: Count", 
        variable.name), sprintf("%s: Percentage", variable.name)))
    return(tab)
}
## code chunk: summarize_demographics_by_treatment
tab.male <- count.and.percentage(dat = dat, variable.name = male.name, 
    by.names = treatment.name)
datatable(data = tab.male, rownames = FALSE)

What About Rounding?

## code chunk: constants
one.digit <- 1
## code chunk: functions
round.numerics <- function(x, digits) {
    if (is.numeric(x)) {
        x <- round(x = x, digits = digits)
    }
    return(x)
}
## code chunk: summarize_demographics_by_treatment
datatable(data = tab.male[, lapply(X = .SD, FUN = "round.numerics", 
    digits = one.digit)], rownames = FALSE)

Summarizing Age by Treatment Group

## code chunk: functions
mean.and.sd <- function(dat, variable.name, by.names, na.rm = TRUE) {
    require(data.table)
    setDT(dat)
    
    tab <- dat[, .(M1 = mean(get(variable.name), na.rm = na.rm), 
        S1 = sd(get(variable.name), na.rm = na.rm)), by = by.names]
    setnames(x = tab, old = c("M1", "S1"), new = c(sprintf("%s: Mean", 
        variable.name), sprintf("%s: SD", variable.name)))
    return(tab)
}
## code chunk: summarize_demographics_by_treatment
tab.age <- mean.and.sd(dat = dat, variable.name = age.name, 
    by.names = treatment.name)
datatable(data = tab.age[, lapply(X = .SD, FUN = "round.numerics", 
    digits = one.digit)], rownames = FALSE)

Adding Some Exposition

The other points below were not included, but they are nonetheless good to think about in analyzing the data from a randomized trial:

Analyzing the Outcomes

Percentage of AIDS or Death Outcomes by Treatment Group

Let’s add a code chunk called outcomes_percentages:

## code chunk: outcomes_percentages
tab.pct.aids.or.death <- count.and.percentage(dat = dat, 
    variable.name = outcome.aids.or.death.name, by.names = treatment.name)
datatable(data = tab.pct.aids.or.death[, lapply(X = .SD, 
    FUN = "round.numerics", digits = one.digit)], rownames = FALSE)

Percentage of Death Outcomes by Treatment Group

## code chunk: outcomes_percentages
tab.pct.death <- count.and.percentage(dat = dat, variable.name = outcome.death.name, 
    by.names = treatment.name)
datatable(data = tab.pct.death[, lapply(X = .SD, FUN = "round.numerics", 
    digits = one.digit)], rownames = FALSE)

Mean Time to AIDS or Death Among Patients with Adverse Outcomes

Let’s add a code chunk called event_times:

## code chunk: event_times
tab.time.aids.or.death <- mean.and.sd(dat = dat, variable.name = time.aids.or.death.name, 
    by.names = treatment.name)
datatable(data = tab.time.aids.or.death[, lapply(X = .SD, 
    FUN = "round.numerics", digits = one.digit)], rownames = FALSE)

Mean Time to Death Among Patients with Adverse Outcomes

## code chunk: event_times
tab.time.death <- mean.and.sd(dat = dat, variable.name = time.death.name, 
    by.names = treatment.name)
datatable(data = tab.time.death[, lapply(X = .SD, FUN = "round.numerics", 
    digits = one.digit)], rownames = FALSE)

Adding Some Conclusions

And That’s a Reproducible Report